condition image
Unified Text-Image-to-Video Generation: A Training-Free Approach to Flexible Visual Conditioning
Lai, Bolin, Lee, Sangmin, Cao, Xu, Li, Xiang, Rehg, James M.
Text-image-to-video (TI2V) generation is a critical problem for controllable video generation using both semantic and visual conditions. Most existing methods typically add visual conditions to text-to-video (T2V) foundation models by finetuning, which is costly in resources and only limited to a few pre-defined conditioning settings. To tackle these constraints, we introduce a unified formulation for TI2V generation with flexible visual conditioning. Furthermore, we propose an innovative training-free approach, dubbed FlexTI2V, that can condition T2V foundation models on an arbitrary amount of images at arbitrary positions. Specifically, we firstly invert the condition images to noisy representation in a latent space. Then, in the denoising process of T2V models, our method uses a novel random patch swapping strategy to incorporate visual features into video representations through local image patches. To balance creativity and fidelity, we use a dynamic control mechanism to adjust the strength of visual conditioning to each video frame. Extensive experiments validate that our method surpasses previous training-free image conditioning methods by a notable margin. Our method can also generalize to both UNet-based and transformer-based architectures.
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Leisure & Entertainment (0.68)
- Media (0.46)
- Transportation (0.46)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
- Europe > Belgium > Flanders > Flemish Brabant > Leuven (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
- Europe > Germany > Saarland (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > China > Guangdong Province > Guangzhou (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > China > Guangdong Province > Guangzhou (0.04)
RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models
Lin, Yijing, Huang, Mengqi, Zhuang, Shuhan, Mao, Zhendong
Unifying diverse image generation tasks within a single framework remains a fundamental challenge in visual generation. While large language models (LLMs) achieve unification through task-agnostic data and generation, existing visual generation models fail to meet these principles. Current approaches either rely on per-task datasets and large-scale training or adapt pre-trained image models with task-specific modifications, limiting their generalizability. In this work, we explore video models as a foundation for unified image generation, leveraging their inherent ability to model temporal correlations. We introduce RealGeneral, a novel framework that reformulates image generation as a conditional frame prediction task, analogous to in-context learning in LLMs. To bridge the gap between video models and condition-image pairs, we propose (1) a Unified Conditional Embedding module for multi-modal alignment and (2) a Unified Stream DiT Block with decoupled adaptive LayerNorm and attention mask to mitigate cross-modal interference. RealGeneral demonstrates effectiveness in multiple important visual generation tasks, e.g., it achieves a 14.5% improvement in subject similarity for customized generation and a 10% enhancement in image quality for canny-to-image task. Project page: https://lyne1.github.io/RealGeneral/
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Tencent Hunyuan3D-1.0: A Unified Framework for Text-to-3D and Image-to-3D Generation
Yang, Xianghui, Shi, Huiwen, Zhang, Bowen, Yang, Fan, Wang, Jiacheng, Zhao, Hongxu, Liu, Xinhai, Wang, Xinzhou, Lin, Qingxiang, Yu, Jiaao, Wang, Lifu, Chen, Zhuo, Liu, Sicong, Liu, Yuhong, Yang, Yong, Wang, Di, Jiang, Jie, Guo, Chunchao
While 3D generative models have greatly improved artists' workflows, the existing diffusion models for 3D generation suffer from slow generation and poor generalization. To address this issue, we propose a two-stage approach named Hunyuan3D-1.0 including a lite version and a standard version, that both support text- and image-conditioned generation. In the first stage, we employ a multi-view diffusion model that efficiently generates multi-view RGB in approximately 4 seconds. These multi-view images capture rich details of the 3D asset from different viewpoints, relaxing the tasks from single-view to multi-view reconstruction. In the second stage, we introduce a feed-forward reconstruction model that rapidly and faithfully reconstructs the 3D asset given the generated multi-view images in approximately 7 seconds. The reconstruction network learns to handle noises and in-consistency introduced by the multi-view diffusion and leverages the available information from the condition image to efficiently recover the 3D structure. Our framework involves the text-to-image model, i.e., Hunyuan-DiT, making it a unified framework to support both text- and image-conditioned 3D generation. Our standard version has 3x more parameters than our lite and other existing model. Our Hunyuan3D-1.0 achieves an impressive balance between speed and quality, significantly reducing generation time while maintaining the quality and diversity of the produced assets.
GenXD: Generating Any 3D and 4D Scenes
Zhao, Yuyang, Lin, Chung-Ching, Lin, Kevin, Yan, Zhiwen, Li, Linjie, Yang, Zhengyuan, Wang, Jianfeng, Lee, Gim Hee, Wang, Lijuan
Figure 1: GenX D is a unified model for high-quality 3D and 4D generation from any number of condition images. By controlling the motion strength and condition masks, GenX D can support various application without any modification. The condition images are shown with star icon and the time dimension is illustrated with dash line. Recent developments in 2D visual generation have been remarkably successful. However, 3D and 4D generation remain challenging in real-world applications due to the lack of large-scale 4D data and effective model design. In this paper, we propose to jointly investigate general 3D and 4D generation by leveraging camera and object movements commonly observed in daily life. Due to the lack of real-world 4D data in the community, we first propose a data curation pipeline to obtain camera poses and object motion strength from videos. Based on this pipeline, we introduce a large-scale real-world 4D scene dataset: CamVid-30K. By leveraging all the 3D and 4D data, we develop our framework, GenX D, which allows us to produce any 3D or 4D scene. We propose multiview-temporal modules, which disentangle camera and object movements, to seamlessly learn from both 3D and 4D data. Additionally, GenX D employs masked latent conditions to support a variety of conditioning views. We perform extensive evaluations across various real-world and synthetic datasets, demonstrating GenX D's effectiveness and versatility compared to previous methods in 3D and 4D generation. The dataset and code will be made publicly available.
- North America > United States > Oklahoma > Beaver County (0.04)
- Asia > Singapore (0.04)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
- Europe > Belgium > Flanders > Flemish Brabant > Leuven (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
- Europe > Germany > Saarland (0.04)
Pose-driven Attention-guided Image Generation for Person Re-Identification
Khatun, Amena, Denman, Simon, Sridharan, Sridha, Fookes, Clinton
Person re-identification (re-ID) concerns the matching of subject images across different camera views in a multi camera surveillance system. One of the major challenges in person re-ID is pose variations across the camera network, which significantly affects the appearance of a person. Existing development data lack adequate pose variations to carry out effective training of person re-ID systems. To solve this issue, in this paper we propose an end-to-end pose-driven attention-guided generative adversarial network, to generate multiple poses of a person. We propose to attentively learn and transfer the subject pose through an attention mechanism. A semantic-consistency loss is proposed to preserve the semantic information of the person during pose transfer. To ensure fine image details are realistic after pose translation, an appearance discriminator is used while a pose discriminator is used to ensure the pose of the transferred images will exactly be the same as the target pose. We show that by incorporating the proposed approach in a person re-identification framework, realistic pose transferred images and state-of-the-art re-identification results can be achieved.
Soft-Gated Warping-GAN for Pose-Guided Person Image Synthesis
Dong, Haoye, Liang, Xiaodan, Gong, Ke, Lai, Hanjiang, Zhu, Jia, Yin, Jian
Despite remarkable advances in image synthesis research, existing works often fail in manipulating images under the context of large geometric transformations. Synthesizing person images conditioned on arbitrary poses is one of the most representative examples where the generation quality largely relies on the capability of identifying and modeling arbitrary transformations on different body parts. Current generative models are often built on local convolutions and overlook the key challenges (e.g. heavy occlusions, different views or dramatic appearance changes) when distinct geometric changes happen for each part, caused by arbitrary pose manipulations. This paper aims to resolve these challenges induced by geometric variability and spatial displacements via a new Soft-Gated Warping Generative Adversarial Network (Warping-GAN), which is composed of two stages: 1) it first synthesizes a target part segmentation map given a target pose, which depicts the region-level spatial layouts for guiding image synthesis with higher-level structure constraints; 2) the Warping-GAN equipped with a soft-gated warping-block learns feature-level mapping to render textures from the original image into the generated segmentation map. Warping-GAN is capable of controlling different transformation degrees given distinct target poses. Moreover, the proposed warping-block is light-weight and flexible enough to be injected into any networks. Human perceptual studies and quantitative evaluations demonstrate the superiority of our Warping-GAN that significantly outperforms all existing methods on two large datasets.
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > China > Guangdong Province > Guangzhou (0.04)